ClusteringAssignment

Assignment 9: Clustering

As is usual, please render the html file and submit it on Canvas. For this assignment, a dataset containing obesity data from individuals in several countries. The file is already uploaded to RStudio Cloud as “Obesity_data.csv”. Read it in to R as a data frame named “obesity”.

obesity = read_csv("Obesity_data.csv")
Rows: 2111 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Gender, family_history_with_overweight, SMOKE
dbl (3): Age, Height, Weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(obesity)
spc_tbl_ [2,111 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Gender                        : chr [1:2111] "Female" "Female" "Male" "Male" ...
 $ Age                           : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
 $ Height                        : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
 $ Weight                        : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
 $ family_history_with_overweight: chr [1:2111] "yes" "yes" "yes" "no" ...
 $ SMOKE                         : chr [1:2111] "no" "yes" "no" "no" ...
 - attr(*, "spec")=
  .. cols(
  ..   Gender = col_character(),
  ..   Age = col_double(),
  ..   Height = col_double(),
  ..   Weight = col_double(),
  ..   family_history_with_overweight = col_character(),
  ..   SMOKE = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
summary(obesity)
    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight    SMOKE          
 Length:2111                    Length:2111       
 Class :character               Class :character  
 Mode  :character               Mode  :character  
                                                  
                                                  
                                                  

Several of the variables are binary and categorical. In the case of these variables, please consider getting rid of those variables. Please make sure to retain the original dataframe. Because data in a clustering problem MUST be numeric, we will only retain numerical variables. Once we form the clusters you will append the clusters back to the original dataframe.

Q1 Is there any missingness in the dataset? Recall that we must address any missingness prior to clustering. If you discover any missingness, use row-wise deletion to eliminate it.

summary(obesity)
    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight    SMOKE          
 Length:2111                    Length:2111       
 Class :character               Class :character  
 Mode  :character               Mode  :character  
                                                  
                                                  
                                                  
  • There is no missing data.

Q2 Create a new data frame to hold scaled values of all of the variables in the obesity data frame. Note: You do not need to exclude any variables from the scaling.

obesity_s = obesity %>% select(-c(1,5,6)) %>% scale() 
summary(obesity_s)
      Age              Height             Weight       
 Min.   :-1.6251   Min.   :-2.69737   Min.   :-1.8169  
 1st Qu.:-0.6879   1st Qu.:-0.76821   1st Qu.:-0.8061  
 Median :-0.2418   Median :-0.01263   Median :-0.1369  
 Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.2659   3rd Qu.: 0.71579   3rd Qu.: 0.7959  
 Max.   : 5.7812   Max.   : 2.98294   Max.   : 3.2994  

Q3 Use the NbClust function to determine the “optimal” number of clusters for this dataset.

#nc = NbClust(obesity_s, min.nc = 2, max.nc = 10, method = "kmeans", 
             #index = "all", alphaBeale = 0.1)
#table(nc$Best.n[1,])
  • 3 clusters is the best choice

Q4 Using the number of clusters that you identified in Question 3, create the clusters. Please set the random number seed to ‘123’

set.seed(123)
fit = kmeans(obesity_s, 3)

Q5 Attach the clustering you created in Question 4 back to the ORIGINAL (not scaled) data frame.

summary(fit$cluster)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.064   3.000   3.000 
obesity$cluster = fit$cluster
summary(obesity)
    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight    SMOKE              cluster     
 Length:2111                    Length:2111        Min.   :1.000  
 Class :character               Class :character   1st Qu.:1.000  
 Mode  :character               Mode  :character   Median :2.000  
                                                   Mean   :2.064  
                                                   3rd Qu.:3.000  
                                                   Max.   :3.000  

Q6 Using the clustering you attached in Question 5, create the following plots (fill color by cluster): a) height versus weight b) age versus height c) age versus weight

plot1 = ggplot(obesity, aes(x=Height, y = Weight, color = cluster)) + geom_point()
ggplotly(plot1)
plot2 = ggplot(obesity, aes(x=Age, y = Height, color = cluster)) + geom_point()
ggplotly(plot2)
plot3 = ggplot(obesity, aes(x=Age, y = Weight, color = cluster)) + geom_point()
ggplotly(plot3)

Q7 Do there appear to be patterns in the data that might suggest obesity?

  • There seems that the younger you are the more likely you are to be obese. It also seems that the taller you are the more likely you are to be obese. So overall I can say that you are more likely to be obese if you are younger and taller.